| Title: “IST707 HW5 Use Decision Tree to Solve a Mystery in History” |
| Name: Sathish Kumar Rajendiran |
| Date: 08/12/2020 |
Exercise: Use Decision Tree to Solve a Mystery in History: who wrote the disputed essays, Hamilton or Madison?
In this homework assignment, you are going to use the decision tree algorithm to solve the disputed essay problem. Last week you used clustering techniques to tackle this problem.
Organize your report using the following template:
Section 1: Data preparation You will need to separate the original data set to training and testing data for classification experiments. Describe what examples in your training and what in your test data.
Section 2: Build and tune decision tree models First build a DT model using default setting, and then tune the parameters to see if better model can be generated. Compare these models using appropriate evaluation measures. Describe and compare the patterns learned in these models.
Section 3: Prediction After building the classification model, apply it to the disputed papers to find out the authorship. Does the DT model reach the same conclusion as the clustering algorithms did?
# import libraries
#create a function to ensure the libraries are imported
EnsurePackage <- function(x){
x <- as.character(x)
if (!require(x,character.only = TRUE)){
install.packages(pkgs=x, repos = "http://cran.us.r-project.org")
require(x, character.only = TRUE)
}
}
# usage example, to load the necessary library for further processing...
EnsurePackage("ggplot2")
Loading required package: ggplot2
EnsurePackage("RColorBrewer")
Loading required package: RColorBrewer
EnsurePackage("NbClust")
Loading required package: NbClust
EnsurePackage("caret")
Loading required package: caret
Loading required package: lattice
Registered S3 method overwritten by 'data.table':
method from
print.data.table
EnsurePackage("rpart")
Loading required package: rpart
EnsurePackage("rpart.plot")
Loading required package: rpart.plot
EnsurePackage("randomForest")
Loading required package: randomForest
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:ggplot2’:
margin
cat("All Packages are available")
All Packages are available
#Load CSV into data frame
filepath <- "/Users/sathishrajendiran/Documents/R/fedPapers85.csv"
fedPapersDF <- data.frame(read.csv(filepath,na.strings=c(""," ","NA")),stringsAsFactors=FALSE)
dim(fedPapersDF) #85 72
[1] 85 72
# Preview the structure
str(fedPapersDF)
'data.frame': 85 obs. of 72 variables:
$ author : Factor w/ 5 levels "dispt","Hamilton",..: 1 1 1 1 1 1 1 1 1 1 ...
$ filename: Factor w/ 85 levels "dispt_fed_49.txt",..: 1 2 3 4 5 6 7 8 9 10 ...
$ a : num 0.28 0.177 0.339 0.27 0.303 0.245 0.349 0.414 0.248 0.442 ...
$ all : num 0.052 0.063 0.09 0.024 0.054 0.059 0.036 0.083 0.04 0.062 ...
$ also : num 0.009 0.013 0.008 0.016 0.027 0.007 0.007 0.009 0.007 0.006 ...
$ an : num 0.096 0.038 0.03 0.024 0.034 0.067 0.029 0.018 0.04 0.075 ...
$ and : num 0.358 0.393 0.301 0.262 0.404 0.282 0.335 0.478 0.356 0.423 ...
$ any : num 0.026 0.063 0.008 0.056 0.04 0.052 0.058 0.046 0.034 0.037 ...
$ are : num 0.131 0.051 0.068 0.064 0.128 0.111 0.087 0.11 0.154 0.093 ...
$ as : num 0.122 0.139 0.203 0.111 0.148 0.252 0.073 0.074 0.161 0.1 ...
$ at : num 0.017 0.114 0.023 0.056 0.013 0.015 0.116 0.037 0.047 0.031 ...
$ be : num 0.411 0.393 0.474 0.365 0.344 0.297 0.378 0.331 0.289 0.379 ...
$ been : num 0.026 0.165 0.015 0.127 0.047 0.03 0.044 0.046 0.027 0.025 ...
$ but : num 0.009 0 0.038 0.032 0.061 0.037 0.007 0.055 0.027 0.037 ...
$ by : num 0.14 0.139 0.173 0.167 0.209 0.186 0.102 0.092 0.168 0.174 ...
$ can : num 0.035 0 0.023 0.056 0.088 0 0.058 0.037 0.047 0.056 ...
$ do : num 0.026 0.013 0 0 0 0 0.015 0.028 0 0 ...
$ down : num 0 0 0.008 0 0 0.007 0 0 0 0 ...
$ even : num 0.009 0.025 0.015 0.024 0.02 0.007 0.007 0.018 0 0.006 ...
$ every : num 0.044 0 0.023 0.04 0.027 0.007 0.087 0.064 0.081 0.05 ...
$ for. : num 0.096 0.076 0.098 0.103 0.141 0.067 0.116 0.055 0.127 0.1 ...
$ from : num 0.044 0.101 0.053 0.079 0.074 0.096 0.08 0.083 0.074 0.124 ...
$ had : num 0.035 0.101 0.008 0.016 0 0.022 0.015 0.009 0.007 0 ...
$ has : num 0.017 0.013 0.015 0.024 0.054 0.015 0.036 0.037 0.02 0.019 ...
$ have : num 0.044 0.152 0.023 0.143 0.047 0.119 0.044 0.074 0.074 0.044 ...
$ her : num 0 0 0 0 0 0 0.007 0 0.034 0.025 ...
$ his : num 0.017 0 0 0.024 0.02 0.067 0 0.018 0.02 0.05 ...
$ if. : num 0 0.025 0.023 0.04 0.034 0.03 0.029 0 0 0.025 ...
$ in. : num 0.262 0.291 0.308 0.238 0.263 0.401 0.189 0.267 0.248 0.274 ...
$ into : num 0.009 0.025 0.038 0.008 0.013 0.037 0 0.037 0.013 0.037 ...
$ is : num 0.157 0.038 0.15 0.151 0.189 0.26 0.167 0.083 0.208 0.23 ...
$ it : num 0.175 0.127 0.173 0.222 0.108 0.156 0.102 0.165 0.134 0.131 ...
$ its : num 0.07 0.038 0.03 0.048 0.013 0.015 0 0.046 0.02 0.019 ...
$ may : num 0.035 0.038 0.12 0.056 0.047 0.074 0.08 0.092 0.027 0.106 ...
$ more : num 0.026 0 0.038 0.056 0.067 0.045 0.08 0.064 0.06 0.081 ...
$ must : num 0.026 0.013 0.083 0.071 0.013 0.015 0.044 0.018 0.027 0.068 ...
$ my : num 0 0 0 0 0 0 0.007 0 0 0 ...
$ no : num 0.035 0 0.03 0.032 0.047 0.059 0.022 0.018 0.02 0.044 ...
$ not : num 0.114 0.127 0.068 0.087 0.128 0.134 0.102 0.101 0.094 0.106 ...
$ now : num 0 0 0 0 0 0 0.007 0 0.007 0.012 ...
$ of : num 0.9 0.747 0.858 0.802 0.869 ...
$ on : num 0.14 0.139 0.15 0.143 0.054 0.141 0.051 0.083 0.127 0.118 ...
$ one : num 0.026 0.025 0.03 0.032 0.047 0.052 0.073 0.046 0.06 0.031 ...
$ only : num 0.035 0 0.023 0.048 0.027 0.022 0.007 0.046 0.02 0.012 ...
$ or : num 0.096 0.114 0.06 0.064 0.081 0.074 0.153 0.037 0.154 0.081 ...
$ our : num 0.017 0 0 0.016 0.027 0.03 0.051 0 0.007 0.025 ...
$ shall : num 0.017 0 0.008 0.016 0 0.015 0.007 0 0.02 0 ...
$ should : num 0.017 0.013 0.068 0.032 0 0.03 0.007 0 0 0.012 ...
$ so : num 0.035 0.013 0.038 0.04 0.027 0.007 0.051 0.018 0.04 0.05 ...
$ some : num 0.009 0.063 0.03 0.024 0.067 0.045 0.007 0.028 0.027 0.025 ...
$ such : num 0.026 0 0.045 0.008 0.027 0.015 0.015 0 0.013 0.031 ...
$ than : num 0.009 0 0.023 0 0.047 0.03 0.109 0.055 0.067 0.044 ...
$ that : num 0.184 0.152 0.188 0.238 0.162 0.208 0.233 0.165 0.208 0.218 ...
$ the : num 1.43 1.25 1.49 1.33 1.19 ...
$ their : num 0.114 0.165 0.053 0.071 0.027 0.089 0.109 0.083 0.154 0.081 ...
$ then : num 0 0 0.015 0.008 0.007 0.007 0.015 0.009 0.007 0.012 ...
$ there : num 0.009 0 0.015 0 0.007 0.007 0.036 0.028 0.02 0 ...
$ things : num 0.009 0 0 0 0 0 0 0 0 0.012 ...
$ this : num 0.044 0.051 0.075 0.103 0.094 0.126 0.08 0.11 0.067 0.093 ...
$ to : num 0.507 0.355 0.361 0.532 0.485 0.445 0.56 0.34 0.49 0.498 ...
$ up : num 0 0 0 0 0 0 0.007 0 0 0 ...
$ upon : num 0 0.013 0 0 0 0 0 0 0 0 ...
$ was : num 0.009 0.051 0.008 0.087 0.027 0.007 0.015 0.018 0.027 0 ...
$ were : num 0.017 0 0.015 0.079 0.02 0.03 0.029 0.009 0.007 0 ...
$ what : num 0 0 0.008 0.008 0.02 0.015 0.015 0.009 0.02 0.025 ...
$ when : num 0.009 0 0 0.024 0.007 0.037 0.007 0 0.02 0.012 ...
$ which : num 0.175 0.114 0.105 0.167 0.155 0.186 0.211 0.175 0.201 0.199 ...
$ who : num 0.044 0.038 0.008 0 0.027 0.045 0.022 0.018 0.04 0.031 ...
$ will : num 0.009 0.089 0.173 0.079 0.168 0.111 0.145 0.267 0.154 0.106 ...
$ with : num 0.087 0.063 0.045 0.079 0.074 0.089 0.073 0.129 0.027 0.081 ...
$ would : num 0.192 0.139 0.068 0.064 0.04 0.037 0.073 0.037 0.04 0.031 ...
$ your : num 0 0 0 0 0 0 0 0 0 0 ...
# Analyze the spread
summary(fedPapersDF)
author filename a all also an and any are
dispt :11 dispt_fed_49.txt: 1 Min. :0.0960 Min. :0.01500 Min. :0.000000 Min. :0.00900 Min. :0.2170 Min. :0.00000 Min. :0.01300
Hamilton:51 dispt_fed_50.txt: 1 1st Qu.:0.2400 1st Qu.:0.03500 1st Qu.:0.000000 1st Qu.:0.04900 1st Qu.:0.3190 1st Qu.:0.02500 1st Qu.:0.05100
HM : 3 dispt_fed_51.txt: 1 Median :0.2990 Median :0.05000 Median :0.007000 Median :0.07100 Median :0.3580 Median :0.04300 Median :0.06800
Jay : 5 dispt_fed_52.txt: 1 Mean :0.2932 Mean :0.05284 Mean :0.007659 Mean :0.06839 Mean :0.3846 Mean :0.04161 Mean :0.07707
Madison :15 dispt_fed_53.txt: 1 3rd Qu.:0.3490 3rd Qu.:0.06600 3rd Qu.:0.013000 3rd Qu.:0.08500 3rd Qu.:0.4130 3rd Qu.:0.05600 3rd Qu.:0.10200
dispt_fed_54.txt: 1 Max. :0.4660 Max. :0.12700 Max. :0.047000 Max. :0.17900 Max. :0.8210 Max. :0.11400 Max. :0.16300
(Other) :79
as at be been but by can do down
Min. :0.0270 Min. :0.00000 Min. :0.0400 Min. :0.00000 Min. :0.00000 Min. :0.0270 Min. :0.00000 Min. :0.000000 Min. :0.000000
1st Qu.:0.1000 1st Qu.:0.02600 1st Qu.:0.2580 1st Qu.:0.03000 1st Qu.:0.02200 1st Qu.:0.0920 1st Qu.:0.01400 1st Qu.:0.000000 1st Qu.:0.000000
Median :0.1240 Median :0.03800 Median :0.3070 Median :0.05300 Median :0.03200 Median :0.1240 Median :0.02900 Median :0.006000 Median :0.000000
Mean :0.1242 Mean :0.04427 Mean :0.3012 Mean :0.05967 Mean :0.03232 Mean :0.1272 Mean :0.03558 Mean :0.006259 Mean :0.001529
3rd Qu.:0.1440 3rd Qu.:0.06300 3rd Qu.:0.3580 3rd Qu.:0.08400 3rd Qu.:0.04200 3rd Qu.:0.1620 3rd Qu.:0.05200 3rd Qu.:0.010000 3rd Qu.:0.000000
Max. :0.2520 Max. :0.11800 Max. :0.4810 Max. :0.16500 Max. :0.08900 Max. :0.2640 Max. :0.11000 Max. :0.028000 Max. :0.017000
even every for. from had has have her
Min. :0.0000 Min. :0.00000 Min. :0.03000 Min. :0.02600 Min. :0.00000 Min. :0.00000 Min. :0.01100 Min. :0.000000
1st Qu.:0.0000 1st Qu.:0.00900 1st Qu.:0.07000 1st Qu.:0.05700 1st Qu.:0.00800 1st Qu.:0.02500 1st Qu.:0.07300 1st Qu.:0.000000
Median :0.0100 Median :0.02200 Median :0.08800 Median :0.07800 Median :0.01600 Median :0.04600 Median :0.09000 Median :0.000000
Mean :0.0114 Mean :0.02391 Mean :0.09376 Mean :0.07978 Mean :0.02116 Mean :0.04442 Mean :0.09474 Mean :0.008094
3rd Qu.:0.0180 3rd Qu.:0.03400 3rd Qu.:0.11400 3rd Qu.:0.09800 3rd Qu.:0.02700 3rd Qu.:0.05700 3rd Qu.:0.12400 3rd Qu.:0.007000
Max. :0.0370 Max. :0.08700 Max. :0.21300 Max. :0.16200 Max. :0.14100 Max. :0.11400 Max. :0.18500 Max. :0.150000
his if. in. into is it its may more
Min. :0.00000 Min. :0.00000 Min. :0.1890 Min. :0.00000 Min. :0.0280 Min. :0.0750 Min. :0.00000 Min. :0.00000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.01600 1st Qu.:0.2670 1st Qu.:0.01000 1st Qu.:0.1180 1st Qu.:0.1290 1st Qu.:0.03000 1st Qu.:0.03600 1st Qu.:0.02300
Median :0.01400 Median :0.02600 Median :0.3040 Median :0.02200 Median :0.1510 Median :0.1510 Median :0.04200 Median :0.05600 Median :0.04400
Mean :0.02862 Mean :0.02733 Mean :0.3174 Mean :0.02409 Mean :0.1563 Mean :0.1567 Mean :0.04836 Mean :0.06181 Mean :0.04561
3rd Qu.:0.03900 3rd Qu.:0.03400 3rd Qu.:0.3550 3rd Qu.:0.03400 3rd Qu.:0.1960 3rd Qu.:0.1900 3rd Qu.:0.06400 3rd Qu.:0.08500 3rd Qu.:0.06100
Max. :0.24700 Max. :0.09900 Max. :0.4990 Max. :0.10500 Max. :0.3230 Max. :0.2840 Max. :0.15000 Max. :0.13400 Max. :0.13000
must my no not now of on one
Min. :0.00000 Min. :0.000000 Min. :0.00000 Min. :0.02000 Min. :0.000000 Min. :0.5620 Min. :0.00000 Min. :0.00500
1st Qu.:0.01400 1st Qu.:0.000000 1st Qu.:0.02000 1st Qu.:0.07500 1st Qu.:0.000000 1st Qu.:0.8560 1st Qu.:0.04300 1st Qu.:0.02700
Median :0.02700 Median :0.000000 Median :0.02900 Median :0.09500 Median :0.005000 Median :0.9020 Median :0.06200 Median :0.03600
Mean :0.03305 Mean :0.003259 Mean :0.03236 Mean :0.09248 Mean :0.006035 Mean :0.9094 Mean :0.06926 Mean :0.04079
3rd Qu.:0.04400 3rd Qu.:0.005000 3rd Qu.:0.04300 3rd Qu.:0.11200 3rd Qu.:0.010000 3rd Qu.:0.9690 3rd Qu.:0.09700 3rd Qu.:0.05000
Max. :0.11100 Max. :0.056000 Max. :0.08300 Max. :0.14800 Max. :0.026000 Max. :1.2110 Max. :0.15600 Max. :0.13500
only or our shall should so some such than
Min. :0.00000 Min. :0.02700 Min. :0.000 Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
1st Qu.:0.01000 1st Qu.:0.07000 1st Qu.:0.000 1st Qu.:0.00600 1st Qu.:0.01000 1st Qu.:0.01800 1st Qu.:0.00900 1st Qu.:0.01800 1st Qu.:0.02700
Median :0.02200 Median :0.08100 Median :0.013 Median :0.01400 Median :0.02700 Median :0.02900 Median :0.01700 Median :0.02900 Median :0.04300
Mean :0.02288 Mean :0.09674 Mean :0.023 Mean :0.01875 Mean :0.02656 Mean :0.02982 Mean :0.01989 Mean :0.02922 Mean :0.04396
3rd Qu.:0.03400 3rd Qu.:0.11600 3rd Qu.:0.028 3rd Qu.:0.02700 3rd Qu.:0.03800 3rd Qu.:0.04000 3rd Qu.:0.02800 3rd Qu.:0.03800 3rd Qu.:0.05500
Max. :0.06500 Max. :0.32100 Max. :0.199 Max. :0.07900 Max. :0.09100 Max. :0.07200 Max. :0.06700 Max. :0.08500 Max. :0.15000
that the their then there things this to up
Min. :0.081 Min. :0.669 Min. :0.00500 Min. :0.000000 Min. :0.00000 Min. :0.000000 Min. :0.00900 Min. :0.3330 Min. :0.000000
1st Qu.:0.171 1st Qu.:1.178 1st Qu.:0.05500 1st Qu.:0.000000 1st Qu.:0.00900 1st Qu.:0.000000 1st Qu.:0.06900 1st Qu.:0.4690 1st Qu.:0.000000
Median :0.208 Median :1.275 Median :0.08600 Median :0.006000 Median :0.02200 Median :0.000000 Median :0.09000 Median :0.5400 Median :0.000000
Mean :0.212 Mean :1.281 Mean :0.08553 Mean :0.006082 Mean :0.02638 Mean :0.002659 Mean :0.08701 Mean :0.5358 Mean :0.003482
3rd Qu.:0.244 3rd Qu.:1.423 3rd Qu.:0.10600 3rd Qu.:0.010000 3rd Qu.:0.03900 3rd Qu.:0.006000 3rd Qu.:0.10500 3rd Qu.:0.6060 3rd Qu.:0.006000
Max. :0.380 Max. :1.803 Max. :0.18300 Max. :0.021000 Max. :0.10500 Max. :0.015000 Max. :0.15300 Max. :0.7760 Max. :0.032000
upon was were what when which who will with
Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.0810 Min. :0.00000 Min. :0.00600 Min. :0.02700
1st Qu.:0.00000 1st Qu.:0.00900 1st Qu.:0.00700 1st Qu.:0.00500 1st Qu.:0.00000 1st Qu.:0.1180 1st Qu.:0.01600 1st Qu.:0.05200 1st Qu.:0.06100
Median :0.02800 Median :0.01500 Median :0.01500 Median :0.01000 Median :0.00900 Median :0.1520 Median :0.02700 Median :0.08100 Median :0.07900
Mean :0.02922 Mean :0.02584 Mean :0.02022 Mean :0.01286 Mean :0.01174 Mean :0.1578 Mean :0.03253 Mean :0.09865 Mean :0.07968
3rd Qu.:0.05000 3rd Qu.:0.03200 3rd Qu.:0.02900 3rd Qu.:0.02000 3rd Qu.:0.01500 3rd Qu.:0.1830 3rd Qu.:0.04400 3rd Qu.:0.13500 3rd Qu.:0.09200
Max. :0.10200 Max. :0.18900 Max. :0.10800 Max. :0.06000 Max. :0.07300 Max. :0.2760 Max. :0.12900 Max. :0.34000 Max. :0.15000
would your
Min. :0.0090 Min. :0.000000
1st Qu.:0.0420 1st Qu.:0.000000
Median :0.0780 Median :0.000000
Mean :0.1017 Mean :0.002024
3rd Qu.:0.1470 3rd Qu.:0.000000
Max. :0.3820 Max. :0.074000
# Preview top few rows
head(fedPapersDF)
# compare number of articles by authors
x <- data.frame(table(fedPapersDF$author))
coul <- brewer.pal(5, "Set2")
barplot(height=x$Freq, names=x$Var1, col=coul,xlab="Authors",
ylab="Number of Papers",
main="FedPapers85 by Authors",
ylim=c(0,60))
# view the data
View(fedPapersDF)
fedPapersDF1 <- subset(fedPapersDF,select=-filename)
fedPapersDF1
# Data preparation
# 1. Training Set Preparation
# Prepare dataframe by removing filename from the list
fedPapersDF1 <- subset(fedPapersDF,select=-filename)
fedPapersDF1
set.seed(100)
# lets split disputed articles separately
fedPapersDF_Dispt <- subset(fedPapersDF1, author=='dispt')
# fedPapersDF_Dispt
# lets split non-disputed articles separately
fedPapersDF_authors <- subset(fedPapersDF1, author!='dispt')
fedPapersDF_authors
# lets split the non-disputed articles into training and test datasets.creates a value for dividing the data into train and test.
# In this case the value is defined as 75% of the number of rows in the dataset
sample_size = floor(0.70*nrow(fedPapersDF_authors)) # 65% --> 80% | 70% --> 82% | 75 -->78% |80% -- 70%
# sample_size #value of the sample size 55
#
# set seed to ensure you always have same random numbers generated #324 has 100% training accuracy
train_index = sample(seq_len(nrow(fedPapersDF_authors)),size = sample_size)
train_data =fedPapersDF_authors[train_index,] #creates the training dataset with row numbers stored in train_index
# # table(train_data$author)
test_data=fedPapersDF_authors[-train_index,] # creates the test dataset excluding the row numbers mentioned in train_index
# # table(test_data$author)
#
cat("\nArticles by Author:")
Articles by Author:
table(fedPapersDF_authors$author)
dispt Hamilton HM Jay Madison
0 51 3 5 15
cat("\nTrain_data - Articles by Author:")
Train_data - Articles by Author:
table(train_data$author)
dispt Hamilton HM Jay Madison
0 36 2 3 10
cat("\nTest_data - Articles by Author:")
Test_data - Articles by Author:
table(test_data$author)
dispt Hamilton HM Jay Madison
0 15 1 2 5
# Section 2: Build and tune decision tree models
# grow tree
rtree <- rpart(author~. ,data=train_data, method='class')
#summarize rtree values
summary(rtree)
Call:
rpart(formula = author ~ ., data = train_data, method = "class")
n= 51
CP nsplit rel error xerror xstd
1 0.60 0 1.0 1.0 0.2169305
2 0.01 1 0.4 0.4 0.1533930
Variable importance
upon there on to an and
26 21 16 16 10 10
Node number 1: 51 observations, complexity param=0.6
predicted class=Hamilton expected loss=0.2941176 P(node) =1
class counts: 0 36 2 3 10
probabilities: 0.000 0.706 0.039 0.059 0.196
left son=2 (35 obs) right son=3 (16 obs)
Primary splits:
upon < 0.0145 to the right, improve=14.497550, (0 missing)
on < 0.0915 to the left, improve=12.450600, (0 missing)
there < 0.0145 to the right, improve=11.162020, (0 missing)
to < 0.499 to the right, improve= 9.435049, (0 missing)
of < 0.8705 to the right, improve= 4.936256, (0 missing)
Surrogate splits:
there < 0.0145 to the right, agree=0.941, adj=0.812, (0 split)
on < 0.0915 to the left, agree=0.882, adj=0.625, (0 split)
to < 0.474 to the right, agree=0.882, adj=0.625, (0 split)
an < 0.064 to the right, agree=0.804, adj=0.375, (0 split)
and < 0.421 to the left, agree=0.804, adj=0.375, (0 split)
Node number 2: 35 observations
predicted class=Hamilton expected loss=0 P(node) =0.6862745
class counts: 0 35 0 0 0
probabilities: 0.000 1.000 0.000 0.000 0.000
Node number 3: 16 observations
predicted class=Madison expected loss=0.375 P(node) =0.3137255
class counts: 0 1 2 3 10
probabilities: 0.000 0.062 0.125 0.188 0.625
plotcp(rtree) # plot cross-validation results
printcp(rtree) # plot cross-validation results
Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class")
Variables actually used in tree construction:
[1] upon
Root node error: 15/51 = 0.29412
n= 51
CP nsplit rel error xerror xstd
1 0.60 0 1.0 1.0 0.21693
2 0.01 1 0.4 0.4 0.15339
# Plot tree | lets Plot decision trees
rpart.plot(rtree,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree
rsq.rpart(rtree) # plot approximate R-squared and relative error for different splits (2 plots)
Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class")
Variables actually used in tree construction:
[1] upon
Root node error: 15/51 = 0.29412
n= 51
CP nsplit rel error xerror xstd
1 0.60 0 1.0 1.0 0.21693
2 0.01 1 0.4 0.4 0.15339
may not be applicable for this method
# grow tree with cp=0
rtree_0 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 0, maxdepth = 5)
#summarize rtree values
summary(rtree_0)
Call:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 0, maxdepth = 5)
n= 51
CP nsplit rel error xerror xstd
1 0.60000000 0 1.00000000 1.0000000 0.2169305
2 0.20000000 1 0.40000000 0.4666667 0.1638321
3 0.13333333 2 0.20000000 0.5333333 0.1731422
4 0.06666667 3 0.06666667 0.5333333 0.1731422
5 0.00000000 4 0.00000000 0.4666667 0.1638321
Variable importance
upon an there and on to no which been every if. it not of a
15 13 12 12 10 10 5 5 4 4 2 2 2 2 1
Node number 1: 51 observations, complexity param=0.6
predicted class=Hamilton expected loss=0.2941176 P(node) =1
class counts: 0 36 2 3 10
probabilities: 0.000 0.706 0.039 0.059 0.196
left son=2 (35 obs) right son=3 (16 obs)
Primary splits:
upon < 0.0145 to the right, improve=14.497550, (0 missing)
on < 0.0915 to the left, improve=12.450600, (0 missing)
there < 0.0145 to the right, improve=11.162020, (0 missing)
to < 0.499 to the right, improve= 9.435049, (0 missing)
of < 0.8705 to the right, improve= 4.936256, (0 missing)
Surrogate splits:
there < 0.0145 to the right, agree=0.941, adj=0.812, (0 split)
on < 0.0915 to the left, agree=0.882, adj=0.625, (0 split)
to < 0.474 to the right, agree=0.882, adj=0.625, (0 split)
an < 0.064 to the right, agree=0.804, adj=0.375, (0 split)
and < 0.421 to the left, agree=0.804, adj=0.375, (0 split)
Node number 2: 35 observations
predicted class=Hamilton expected loss=0 P(node) =0.6862745
class counts: 0 35 0 0 0
probabilities: 0.000 1.000 0.000 0.000 0.000
Node number 3: 16 observations, complexity param=0.2
predicted class=Madison expected loss=0.375 P(node) =0.3137255
class counts: 0 1 2 3 10
probabilities: 0.000 0.062 0.125 0.188 0.625
left son=6 (6 obs) right son=7 (10 obs)
Primary splits:
no < 0.021 to the left, improve=5.208333, (0 missing)
an < 0.046 to the left, improve=4.656818, (0 missing)
which < 0.113 to the left, improve=4.256818, (0 missing)
of < 0.7305 to the left, improve=3.951923, (0 missing)
the < 1.019 to the left, improve=3.951923, (0 missing)
Surrogate splits:
an < 0.046 to the left, agree=0.938, adj=0.833, (0 split)
which < 0.113 to the left, agree=0.938, adj=0.833, (0 split)
and < 0.467 to the right, agree=0.875, adj=0.667, (0 split)
been < 0.0275 to the left, agree=0.875, adj=0.667, (0 split)
every < 0.0125 to the left, agree=0.875, adj=0.667, (0 split)
Node number 6: 6 observations, complexity param=0.1333333
predicted class=Jay expected loss=0.5 P(node) =0.1176471
class counts: 0 1 2 3 0
probabilities: 0.000 0.167 0.333 0.500 0.000
left son=12 (3 obs) right son=13 (3 obs)
Primary splits:
an < 0.032 to the right, improve=2.333333, (0 missing)
and < 0.5665 to the left, improve=2.333333, (0 missing)
if. < 0.0175 to the left, improve=2.333333, (0 missing)
it < 0.17 to the left, improve=2.333333, (0 missing)
not < 0.0695 to the left, improve=2.333333, (0 missing)
Surrogate splits:
and < 0.5665 to the left, agree=1, adj=1, (0 split)
if. < 0.0175 to the left, agree=1, adj=1, (0 split)
it < 0.17 to the left, agree=1, adj=1, (0 split)
not < 0.0695 to the left, agree=1, adj=1, (0 split)
of < 0.7305 to the right, agree=1, adj=1, (0 split)
Node number 7: 10 observations
predicted class=Madison expected loss=0 P(node) =0.1960784
class counts: 0 0 0 0 10
probabilities: 0.000 0.000 0.000 0.000 1.000
Node number 12: 3 observations, complexity param=0.06666667
predicted class=HM expected loss=0.3333333 P(node) =0.05882353
class counts: 0 1 2 0 0
probabilities: 0.000 0.333 0.667 0.000 0.000
left son=24 (1 obs) right son=25 (2 obs)
Primary splits:
a < 0.279 to the right, improve=1.333333, (0 missing)
all < 0.0385 to the left, improve=1.333333, (0 missing)
an < 0.0485 to the right, improve=1.333333, (0 missing)
and < 0.4115 to the left, improve=1.333333, (0 missing)
any < 0.0215 to the right, improve=1.333333, (0 missing)
Node number 13: 3 observations
predicted class=Jay expected loss=0 P(node) =0.05882353
class counts: 0 0 0 3 0
probabilities: 0.000 0.000 0.000 1.000 0.000
Node number 24: 1 observations
predicted class=Hamilton expected loss=0 P(node) =0.01960784
class counts: 0 1 0 0 0
probabilities: 0.000 1.000 0.000 0.000 0.000
Node number 25: 2 observations
predicted class=HM expected loss=0 P(node) =0.03921569
class counts: 0 0 2 0 0
probabilities: 0.000 0.000 1.000 0.000 0.000
plotcp(rtree_0) # plot cross-validation results
printcp(rtree_0) # plot cross-validation results
Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 0, maxdepth = 5)
Variables actually used in tree construction:
[1] a an no upon
Root node error: 15/51 = 0.29412
n= 51
CP nsplit rel error xerror xstd
1 0.600000 0 1.000000 1.00000 0.21693
2 0.200000 1 0.400000 0.46667 0.16383
3 0.133333 2 0.200000 0.53333 0.17314
4 0.066667 3 0.066667 0.53333 0.17314
5 0.000000 4 0.000000 0.46667 0.16383
# Plot tree | lets Plot decision trees
rpart.plot(rtree_0,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree
rsq.rpart(rtree_0) # plot approximate R-squared and relative error for different splits (2 plots)
Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 0, maxdepth = 5)
Variables actually used in tree construction:
[1] a an no upon
Root node error: 15/51 = 0.29412
n= 51
CP nsplit rel error xerror xstd
1 0.600000 0 1.000000 1.00000 0.21693
2 0.200000 1 0.400000 0.46667 0.16383
3 0.133333 2 0.200000 0.53333 0.17314
4 0.066667 3 0.066667 0.53333 0.17314
5 0.000000 4 0.000000 0.46667 0.16383
may not be applicable for this method
NA
# grow tree with cp=0 , minsplit = 10, maxdepth = 5
rtree_1 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 1, maxdepth = 5)
#summarize rtree values
summary(rtree_1)
Call:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 1, maxdepth = 5)
n= 51
CP nsplit rel error xerror xstd
1 0.60000000 0 1.00000000 1.0000000 0.2169305
2 0.20000000 1 0.40000000 0.4000000 0.1533930
3 0.13333333 2 0.20000000 0.4666667 0.1638321
4 0.06666667 3 0.06666667 0.4666667 0.1638321
5 0.00000000 4 0.00000000 0.4000000 0.1533930
Variable importance
upon an there and on to no which been every if. it not of a
15 13 12 12 10 10 5 5 4 4 2 2 2 2 1
Node number 1: 51 observations, complexity param=0.6
predicted class=Hamilton expected loss=0.2941176 P(node) =1
class counts: 0 36 2 3 10
probabilities: 0.000 0.706 0.039 0.059 0.196
left son=2 (35 obs) right son=3 (16 obs)
Primary splits:
upon < 0.0145 to the right, improve=14.497550, (0 missing)
on < 0.0915 to the left, improve=12.450600, (0 missing)
there < 0.0145 to the right, improve=11.162020, (0 missing)
to < 0.499 to the right, improve= 9.435049, (0 missing)
of < 0.8705 to the right, improve= 4.936256, (0 missing)
Surrogate splits:
there < 0.0145 to the right, agree=0.941, adj=0.812, (0 split)
on < 0.0915 to the left, agree=0.882, adj=0.625, (0 split)
to < 0.474 to the right, agree=0.882, adj=0.625, (0 split)
an < 0.064 to the right, agree=0.804, adj=0.375, (0 split)
and < 0.421 to the left, agree=0.804, adj=0.375, (0 split)
Node number 2: 35 observations
predicted class=Hamilton expected loss=0 P(node) =0.6862745
class counts: 0 35 0 0 0
probabilities: 0.000 1.000 0.000 0.000 0.000
Node number 3: 16 observations, complexity param=0.2
predicted class=Madison expected loss=0.375 P(node) =0.3137255
class counts: 0 1 2 3 10
probabilities: 0.000 0.062 0.125 0.188 0.625
left son=6 (6 obs) right son=7 (10 obs)
Primary splits:
no < 0.021 to the left, improve=5.208333, (0 missing)
an < 0.046 to the left, improve=4.656818, (0 missing)
which < 0.113 to the left, improve=4.256818, (0 missing)
of < 0.7305 to the left, improve=3.951923, (0 missing)
the < 1.019 to the left, improve=3.951923, (0 missing)
Surrogate splits:
an < 0.046 to the left, agree=0.938, adj=0.833, (0 split)
which < 0.113 to the left, agree=0.938, adj=0.833, (0 split)
and < 0.467 to the right, agree=0.875, adj=0.667, (0 split)
been < 0.0275 to the left, agree=0.875, adj=0.667, (0 split)
every < 0.0125 to the left, agree=0.875, adj=0.667, (0 split)
Node number 6: 6 observations, complexity param=0.1333333
predicted class=Jay expected loss=0.5 P(node) =0.1176471
class counts: 0 1 2 3 0
probabilities: 0.000 0.167 0.333 0.500 0.000
left son=12 (3 obs) right son=13 (3 obs)
Primary splits:
an < 0.032 to the right, improve=2.333333, (0 missing)
and < 0.5665 to the left, improve=2.333333, (0 missing)
if. < 0.0175 to the left, improve=2.333333, (0 missing)
it < 0.17 to the left, improve=2.333333, (0 missing)
not < 0.0695 to the left, improve=2.333333, (0 missing)
Surrogate splits:
and < 0.5665 to the left, agree=1, adj=1, (0 split)
if. < 0.0175 to the left, agree=1, adj=1, (0 split)
it < 0.17 to the left, agree=1, adj=1, (0 split)
not < 0.0695 to the left, agree=1, adj=1, (0 split)
of < 0.7305 to the right, agree=1, adj=1, (0 split)
Node number 7: 10 observations
predicted class=Madison expected loss=0 P(node) =0.1960784
class counts: 0 0 0 0 10
probabilities: 0.000 0.000 0.000 0.000 1.000
Node number 12: 3 observations, complexity param=0.06666667
predicted class=HM expected loss=0.3333333 P(node) =0.05882353
class counts: 0 1 2 0 0
probabilities: 0.000 0.333 0.667 0.000 0.000
left son=24 (1 obs) right son=25 (2 obs)
Primary splits:
a < 0.279 to the right, improve=1.333333, (0 missing)
all < 0.0385 to the left, improve=1.333333, (0 missing)
an < 0.0485 to the right, improve=1.333333, (0 missing)
and < 0.4115 to the left, improve=1.333333, (0 missing)
any < 0.0215 to the right, improve=1.333333, (0 missing)
Node number 13: 3 observations
predicted class=Jay expected loss=0 P(node) =0.05882353
class counts: 0 0 0 3 0
probabilities: 0.000 0.000 0.000 1.000 0.000
Node number 24: 1 observations
predicted class=Hamilton expected loss=0 P(node) =0.01960784
class counts: 0 1 0 0 0
probabilities: 0.000 1.000 0.000 0.000 0.000
Node number 25: 2 observations
predicted class=HM expected loss=0 P(node) =0.03921569
class counts: 0 0 2 0 0
probabilities: 0.000 0.000 1.000 0.000 0.000
plotcp(rtree_1) # plot cross-validation results
printcp(rtree_1) # plot cross-validation results
Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 1, maxdepth = 5)
Variables actually used in tree construction:
[1] a an no upon
Root node error: 15/51 = 0.29412
n= 51
CP nsplit rel error xerror xstd
1 0.600000 0 1.000000 1.00000 0.21693
2 0.200000 1 0.400000 0.40000 0.15339
3 0.133333 2 0.200000 0.46667 0.16383
4 0.066667 3 0.066667 0.46667 0.16383
5 0.000000 4 0.000000 0.40000 0.15339
# Plot tree | lets Plot decision trees
rpart.plot(rtree_1,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree
rsq.rpart(rtree_1) # plot approximate R-squared and relative error for different splits (2 plots)
Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 1, maxdepth = 5)
Variables actually used in tree construction:
[1] a an no upon
Root node error: 15/51 = 0.29412
n= 51
CP nsplit rel error xerror xstd
1 0.600000 0 1.000000 1.00000 0.21693
2 0.200000 1 0.400000 0.40000 0.15339
3 0.133333 2 0.200000 0.46667 0.16383
4 0.066667 3 0.066667 0.46667 0.16383
5 0.000000 4 0.000000 0.40000 0.15339
may not be applicable for this method
# grow tree with cp=0 , minsplit = 10, maxdepth = 5
rtree_2 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 2, maxdepth = 10)
#summarize rtree values
summary(rtree_2)
Call:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 2, maxdepth = 10)
n= 51
CP nsplit rel error xerror xstd
1 0.60000000 0 1.00000000 1.0000000 0.2169305
2 0.20000000 1 0.40000000 0.4000000 0.1533930
3 0.13333333 2 0.20000000 0.4666667 0.1638321
4 0.06666667 3 0.06666667 0.4666667 0.1638321
5 0.00000000 4 0.00000000 0.4666667 0.1638321
Variable importance
upon an there and on to no which been every if. it not of a
15 13 12 12 10 10 5 5 4 4 2 2 2 2 1
Node number 1: 51 observations, complexity param=0.6
predicted class=Hamilton expected loss=0.2941176 P(node) =1
class counts: 0 36 2 3 10
probabilities: 0.000 0.706 0.039 0.059 0.196
left son=2 (35 obs) right son=3 (16 obs)
Primary splits:
upon < 0.0145 to the right, improve=14.497550, (0 missing)
on < 0.0915 to the left, improve=12.450600, (0 missing)
there < 0.0145 to the right, improve=11.162020, (0 missing)
to < 0.499 to the right, improve= 9.435049, (0 missing)
of < 0.8705 to the right, improve= 4.936256, (0 missing)
Surrogate splits:
there < 0.0145 to the right, agree=0.941, adj=0.812, (0 split)
on < 0.0915 to the left, agree=0.882, adj=0.625, (0 split)
to < 0.474 to the right, agree=0.882, adj=0.625, (0 split)
an < 0.064 to the right, agree=0.804, adj=0.375, (0 split)
and < 0.421 to the left, agree=0.804, adj=0.375, (0 split)
Node number 2: 35 observations
predicted class=Hamilton expected loss=0 P(node) =0.6862745
class counts: 0 35 0 0 0
probabilities: 0.000 1.000 0.000 0.000 0.000
Node number 3: 16 observations, complexity param=0.2
predicted class=Madison expected loss=0.375 P(node) =0.3137255
class counts: 0 1 2 3 10
probabilities: 0.000 0.062 0.125 0.188 0.625
left son=6 (6 obs) right son=7 (10 obs)
Primary splits:
no < 0.021 to the left, improve=5.208333, (0 missing)
an < 0.046 to the left, improve=4.656818, (0 missing)
which < 0.113 to the left, improve=4.256818, (0 missing)
of < 0.7305 to the left, improve=3.951923, (0 missing)
the < 1.019 to the left, improve=3.951923, (0 missing)
Surrogate splits:
an < 0.046 to the left, agree=0.938, adj=0.833, (0 split)
which < 0.113 to the left, agree=0.938, adj=0.833, (0 split)
and < 0.467 to the right, agree=0.875, adj=0.667, (0 split)
been < 0.0275 to the left, agree=0.875, adj=0.667, (0 split)
every < 0.0125 to the left, agree=0.875, adj=0.667, (0 split)
Node number 6: 6 observations, complexity param=0.1333333
predicted class=Jay expected loss=0.5 P(node) =0.1176471
class counts: 0 1 2 3 0
probabilities: 0.000 0.167 0.333 0.500 0.000
left son=12 (3 obs) right son=13 (3 obs)
Primary splits:
an < 0.032 to the right, improve=2.333333, (0 missing)
and < 0.5665 to the left, improve=2.333333, (0 missing)
if. < 0.0175 to the left, improve=2.333333, (0 missing)
it < 0.17 to the left, improve=2.333333, (0 missing)
not < 0.0695 to the left, improve=2.333333, (0 missing)
Surrogate splits:
and < 0.5665 to the left, agree=1, adj=1, (0 split)
if. < 0.0175 to the left, agree=1, adj=1, (0 split)
it < 0.17 to the left, agree=1, adj=1, (0 split)
not < 0.0695 to the left, agree=1, adj=1, (0 split)
of < 0.7305 to the right, agree=1, adj=1, (0 split)
Node number 7: 10 observations
predicted class=Madison expected loss=0 P(node) =0.1960784
class counts: 0 0 0 0 10
probabilities: 0.000 0.000 0.000 0.000 1.000
Node number 12: 3 observations, complexity param=0.06666667
predicted class=HM expected loss=0.3333333 P(node) =0.05882353
class counts: 0 1 2 0 0
probabilities: 0.000 0.333 0.667 0.000 0.000
left son=24 (1 obs) right son=25 (2 obs)
Primary splits:
a < 0.279 to the right, improve=1.333333, (0 missing)
all < 0.0385 to the left, improve=1.333333, (0 missing)
an < 0.0485 to the right, improve=1.333333, (0 missing)
and < 0.4115 to the left, improve=1.333333, (0 missing)
any < 0.0215 to the right, improve=1.333333, (0 missing)
Node number 13: 3 observations
predicted class=Jay expected loss=0 P(node) =0.05882353
class counts: 0 0 0 3 0
probabilities: 0.000 0.000 0.000 1.000 0.000
Node number 24: 1 observations
predicted class=Hamilton expected loss=0 P(node) =0.01960784
class counts: 0 1 0 0 0
probabilities: 0.000 1.000 0.000 0.000 0.000
Node number 25: 2 observations
predicted class=HM expected loss=0 P(node) =0.03921569
class counts: 0 0 2 0 0
probabilities: 0.000 0.000 1.000 0.000 0.000
plotcp(rtree_2) # plot cross-validation results
printcp(rtree_2) # plot cross-validation results
Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 2, maxdepth = 10)
Variables actually used in tree construction:
[1] a an no upon
Root node error: 15/51 = 0.29412
n= 51
CP nsplit rel error xerror xstd
1 0.600000 0 1.000000 1.00000 0.21693
2 0.200000 1 0.400000 0.40000 0.15339
3 0.133333 2 0.200000 0.46667 0.16383
4 0.066667 3 0.066667 0.46667 0.16383
5 0.000000 4 0.000000 0.46667 0.16383
# Plot tree | lets Plot decision trees
rpart.plot(rtree_2,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree
rsq.rpart(rtree_2) # plot approximate R-squared and relative error for different splits (2 plots)
Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 2, maxdepth = 10)
Variables actually used in tree construction:
[1] a an no upon
Root node error: 15/51 = 0.29412
n= 51
CP nsplit rel error xerror xstd
1 0.600000 0 1.000000 1.00000 0.21693
2 0.200000 1 0.400000 0.40000 0.15339
3 0.133333 2 0.200000 0.46667 0.16383
4 0.066667 3 0.066667 0.46667 0.16383
5 0.000000 4 0.000000 0.46667 0.16383
may not be applicable for this method
# grow tree with cp=0 , minsplit = 10, maxdepth = 5
rtree_3 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 3, maxdepth = 5)
#summarize rtree values
summary(rtree_3)
Call:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 3, maxdepth = 5)
n= 51
CP nsplit rel error xerror xstd
1 0.60000000 0 1.00000000 1.0000000 0.2169305
2 0.20000000 1 0.40000000 0.4000000 0.1533930
3 0.13333333 2 0.20000000 0.4000000 0.1533930
4 0.06666667 3 0.06666667 0.4000000 0.1533930
5 0.00000000 4 0.00000000 0.3333333 0.1415753
Variable importance
upon an there and on to no which been every if. it not of a
15 13 12 12 10 10 5 5 4 4 2 2 2 2 1
Node number 1: 51 observations, complexity param=0.6
predicted class=Hamilton expected loss=0.2941176 P(node) =1
class counts: 0 36 2 3 10
probabilities: 0.000 0.706 0.039 0.059 0.196
left son=2 (35 obs) right son=3 (16 obs)
Primary splits:
upon < 0.0145 to the right, improve=14.497550, (0 missing)
on < 0.0915 to the left, improve=12.450600, (0 missing)
there < 0.0145 to the right, improve=11.162020, (0 missing)
to < 0.499 to the right, improve= 9.435049, (0 missing)
of < 0.8705 to the right, improve= 4.936256, (0 missing)
Surrogate splits:
there < 0.0145 to the right, agree=0.941, adj=0.812, (0 split)
on < 0.0915 to the left, agree=0.882, adj=0.625, (0 split)
to < 0.474 to the right, agree=0.882, adj=0.625, (0 split)
an < 0.064 to the right, agree=0.804, adj=0.375, (0 split)
and < 0.421 to the left, agree=0.804, adj=0.375, (0 split)
Node number 2: 35 observations
predicted class=Hamilton expected loss=0 P(node) =0.6862745
class counts: 0 35 0 0 0
probabilities: 0.000 1.000 0.000 0.000 0.000
Node number 3: 16 observations, complexity param=0.2
predicted class=Madison expected loss=0.375 P(node) =0.3137255
class counts: 0 1 2 3 10
probabilities: 0.000 0.062 0.125 0.188 0.625
left son=6 (6 obs) right son=7 (10 obs)
Primary splits:
no < 0.021 to the left, improve=5.208333, (0 missing)
an < 0.046 to the left, improve=4.656818, (0 missing)
which < 0.113 to the left, improve=4.256818, (0 missing)
of < 0.7305 to the left, improve=3.951923, (0 missing)
the < 1.019 to the left, improve=3.951923, (0 missing)
Surrogate splits:
an < 0.046 to the left, agree=0.938, adj=0.833, (0 split)
which < 0.113 to the left, agree=0.938, adj=0.833, (0 split)
and < 0.467 to the right, agree=0.875, adj=0.667, (0 split)
been < 0.0275 to the left, agree=0.875, adj=0.667, (0 split)
every < 0.0125 to the left, agree=0.875, adj=0.667, (0 split)
Node number 6: 6 observations, complexity param=0.1333333
predicted class=Jay expected loss=0.5 P(node) =0.1176471
class counts: 0 1 2 3 0
probabilities: 0.000 0.167 0.333 0.500 0.000
left son=12 (3 obs) right son=13 (3 obs)
Primary splits:
an < 0.032 to the right, improve=2.333333, (0 missing)
and < 0.5665 to the left, improve=2.333333, (0 missing)
if. < 0.0175 to the left, improve=2.333333, (0 missing)
it < 0.17 to the left, improve=2.333333, (0 missing)
not < 0.0695 to the left, improve=2.333333, (0 missing)
Surrogate splits:
and < 0.5665 to the left, agree=1, adj=1, (0 split)
if. < 0.0175 to the left, agree=1, adj=1, (0 split)
it < 0.17 to the left, agree=1, adj=1, (0 split)
not < 0.0695 to the left, agree=1, adj=1, (0 split)
of < 0.7305 to the right, agree=1, adj=1, (0 split)
Node number 7: 10 observations
predicted class=Madison expected loss=0 P(node) =0.1960784
class counts: 0 0 0 0 10
probabilities: 0.000 0.000 0.000 0.000 1.000
Node number 12: 3 observations, complexity param=0.06666667
predicted class=HM expected loss=0.3333333 P(node) =0.05882353
class counts: 0 1 2 0 0
probabilities: 0.000 0.333 0.667 0.000 0.000
left son=24 (1 obs) right son=25 (2 obs)
Primary splits:
a < 0.279 to the right, improve=1.333333, (0 missing)
all < 0.0385 to the left, improve=1.333333, (0 missing)
an < 0.0485 to the right, improve=1.333333, (0 missing)
and < 0.4115 to the left, improve=1.333333, (0 missing)
any < 0.0215 to the right, improve=1.333333, (0 missing)
Node number 13: 3 observations
predicted class=Jay expected loss=0 P(node) =0.05882353
class counts: 0 0 0 3 0
probabilities: 0.000 0.000 0.000 1.000 0.000
Node number 24: 1 observations
predicted class=Hamilton expected loss=0 P(node) =0.01960784
class counts: 0 1 0 0 0
probabilities: 0.000 1.000 0.000 0.000 0.000
Node number 25: 2 observations
predicted class=HM expected loss=0 P(node) =0.03921569
class counts: 0 0 2 0 0
probabilities: 0.000 0.000 1.000 0.000 0.000
plotcp(rtree_3) # plot cross-validation results
printcp(rtree_3) # plot cross-validation results
Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 3, maxdepth = 5)
Variables actually used in tree construction:
[1] a an no upon
Root node error: 15/51 = 0.29412
n= 51
CP nsplit rel error xerror xstd
1 0.600000 0 1.000000 1.00000 0.21693
2 0.200000 1 0.400000 0.40000 0.15339
3 0.133333 2 0.200000 0.40000 0.15339
4 0.066667 3 0.066667 0.40000 0.15339
5 0.000000 4 0.000000 0.33333 0.14158
# Plot tree | lets Plot decision trees
rpart.plot(rtree_3,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree
rsq.rpart(rtree_3) # plot approximate R-squared and relative error for different splits (2 plots)
Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 3, maxdepth = 5)
Variables actually used in tree construction:
[1] a an no upon
Root node error: 15/51 = 0.29412
n= 51
CP nsplit rel error xerror xstd
1 0.600000 0 1.000000 1.00000 0.21693
2 0.200000 1 0.400000 0.40000 0.15339
3 0.133333 2 0.200000 0.40000 0.15339
4 0.066667 3 0.066667 0.40000 0.15339
5 0.000000 4 0.000000 0.33333 0.14158
may not be applicable for this method
# grow tree with cp=0 , minsplit = 10, maxdepth = 5
rtree_4 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 4, maxdepth = 5)
#summarize rtree values
summary(rtree_4)
Call:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 4, maxdepth = 5)
n= 51
CP nsplit rel error xerror xstd
1 0.6000000 0 1.00000000 1.0000000 0.2169305
2 0.2000000 1 0.40000000 0.4000000 0.1533930
3 0.1333333 2 0.20000000 0.4666667 0.1638321
4 0.0000000 3 0.06666667 0.4000000 0.1533930
Variable importance
upon an there and on to no which been every if. it not of
15 13 13 12 10 10 6 5 4 4 2 2 2 2
Node number 1: 51 observations, complexity param=0.6
predicted class=Hamilton expected loss=0.2941176 P(node) =1
class counts: 0 36 2 3 10
probabilities: 0.000 0.706 0.039 0.059 0.196
left son=2 (35 obs) right son=3 (16 obs)
Primary splits:
upon < 0.0145 to the right, improve=14.497550, (0 missing)
on < 0.0915 to the left, improve=12.450600, (0 missing)
there < 0.0145 to the right, improve=11.162020, (0 missing)
to < 0.499 to the right, improve= 9.435049, (0 missing)
of < 0.8705 to the right, improve= 4.936256, (0 missing)
Surrogate splits:
there < 0.0145 to the right, agree=0.941, adj=0.812, (0 split)
on < 0.0915 to the left, agree=0.882, adj=0.625, (0 split)
to < 0.474 to the right, agree=0.882, adj=0.625, (0 split)
an < 0.064 to the right, agree=0.804, adj=0.375, (0 split)
and < 0.421 to the left, agree=0.804, adj=0.375, (0 split)
Node number 2: 35 observations
predicted class=Hamilton expected loss=0 P(node) =0.6862745
class counts: 0 35 0 0 0
probabilities: 0.000 1.000 0.000 0.000 0.000
Node number 3: 16 observations, complexity param=0.2
predicted class=Madison expected loss=0.375 P(node) =0.3137255
class counts: 0 1 2 3 10
probabilities: 0.000 0.062 0.125 0.188 0.625
left son=6 (6 obs) right son=7 (10 obs)
Primary splits:
no < 0.021 to the left, improve=5.208333, (0 missing)
an < 0.046 to the left, improve=4.656818, (0 missing)
which < 0.113 to the left, improve=4.256818, (0 missing)
of < 0.7305 to the left, improve=3.951923, (0 missing)
the < 1.019 to the left, improve=3.951923, (0 missing)
Surrogate splits:
an < 0.046 to the left, agree=0.938, adj=0.833, (0 split)
which < 0.113 to the left, agree=0.938, adj=0.833, (0 split)
and < 0.467 to the right, agree=0.875, adj=0.667, (0 split)
been < 0.0275 to the left, agree=0.875, adj=0.667, (0 split)
every < 0.0125 to the left, agree=0.875, adj=0.667, (0 split)
Node number 6: 6 observations, complexity param=0.1333333
predicted class=Jay expected loss=0.5 P(node) =0.1176471
class counts: 0 1 2 3 0
probabilities: 0.000 0.167 0.333 0.500 0.000
left son=12 (3 obs) right son=13 (3 obs)
Primary splits:
an < 0.032 to the right, improve=2.333333, (0 missing)
and < 0.5665 to the left, improve=2.333333, (0 missing)
if. < 0.0175 to the left, improve=2.333333, (0 missing)
it < 0.17 to the left, improve=2.333333, (0 missing)
not < 0.0695 to the left, improve=2.333333, (0 missing)
Surrogate splits:
and < 0.5665 to the left, agree=1, adj=1, (0 split)
if. < 0.0175 to the left, agree=1, adj=1, (0 split)
it < 0.17 to the left, agree=1, adj=1, (0 split)
not < 0.0695 to the left, agree=1, adj=1, (0 split)
of < 0.7305 to the right, agree=1, adj=1, (0 split)
Node number 7: 10 observations
predicted class=Madison expected loss=0 P(node) =0.1960784
class counts: 0 0 0 0 10
probabilities: 0.000 0.000 0.000 0.000 1.000
Node number 12: 3 observations
predicted class=HM expected loss=0.3333333 P(node) =0.05882353
class counts: 0 1 2 0 0
probabilities: 0.000 0.333 0.667 0.000 0.000
Node number 13: 3 observations
predicted class=Jay expected loss=0 P(node) =0.05882353
class counts: 0 0 0 3 0
probabilities: 0.000 0.000 0.000 1.000 0.000
plotcp(rtree_4) # plot cross-validation results
printcp(rtree_4) # plot cross-validation results
Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 4, maxdepth = 5)
Variables actually used in tree construction:
[1] an no upon
Root node error: 15/51 = 0.29412
n= 51
CP nsplit rel error xerror xstd
1 0.60000 0 1.000000 1.00000 0.21693
2 0.20000 1 0.400000 0.40000 0.15339
3 0.13333 2 0.200000 0.46667 0.16383
4 0.00000 3 0.066667 0.40000 0.15339
# Plot tree | lets Plot decision trees
rpart.plot(rtree_4,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree
rsq.rpart(rtree_4) # plot approximate R-squared and relative error for different splits (2 plots)
Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 4, maxdepth = 5)
Variables actually used in tree construction:
[1] an no upon
Root node error: 15/51 = 0.29412
n= 51
CP nsplit rel error xerror xstd
1 0.60000 0 1.000000 1.00000 0.21693
2 0.20000 1 0.400000 0.40000 0.15339
3 0.13333 2 0.200000 0.46667 0.16383
4 0.00000 3 0.066667 0.40000 0.15339
may not be applicable for this method
# grow tree with cp=0 , minsplit = 3, maxdepth = 5, minbucket = 1
rtree_5 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 3, maxdepth = 5, minbucket = 1)
#summarize rtree values
summary(rtree_5)
Call:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 3, maxdepth = 5, minbucket = 1)
n= 51
CP nsplit rel error xerror xstd
1 0.60000000 0 1.00000000 1.0000000 0.2169305
2 0.20000000 1 0.40000000 0.4000000 0.1533930
3 0.13333333 2 0.20000000 0.5333333 0.1731422
4 0.06666667 3 0.06666667 0.6000000 0.1814970
5 0.00000000 4 0.00000000 0.5333333 0.1731422
Variable importance
upon an there and on to no which been every if. it not of a
15 13 12 12 10 10 5 5 4 4 2 2 2 2 1
Node number 1: 51 observations, complexity param=0.6
predicted class=Hamilton expected loss=0.2941176 P(node) =1
class counts: 0 36 2 3 10
probabilities: 0.000 0.706 0.039 0.059 0.196
left son=2 (35 obs) right son=3 (16 obs)
Primary splits:
upon < 0.0145 to the right, improve=14.497550, (0 missing)
on < 0.0915 to the left, improve=12.450600, (0 missing)
there < 0.0145 to the right, improve=11.162020, (0 missing)
to < 0.499 to the right, improve= 9.435049, (0 missing)
of < 0.8705 to the right, improve= 4.936256, (0 missing)
Surrogate splits:
there < 0.0145 to the right, agree=0.941, adj=0.812, (0 split)
on < 0.0915 to the left, agree=0.882, adj=0.625, (0 split)
to < 0.474 to the right, agree=0.882, adj=0.625, (0 split)
an < 0.064 to the right, agree=0.804, adj=0.375, (0 split)
and < 0.421 to the left, agree=0.804, adj=0.375, (0 split)
Node number 2: 35 observations
predicted class=Hamilton expected loss=0 P(node) =0.6862745
class counts: 0 35 0 0 0
probabilities: 0.000 1.000 0.000 0.000 0.000
Node number 3: 16 observations, complexity param=0.2
predicted class=Madison expected loss=0.375 P(node) =0.3137255
class counts: 0 1 2 3 10
probabilities: 0.000 0.062 0.125 0.188 0.625
left son=6 (6 obs) right son=7 (10 obs)
Primary splits:
no < 0.021 to the left, improve=5.208333, (0 missing)
an < 0.046 to the left, improve=4.656818, (0 missing)
which < 0.113 to the left, improve=4.256818, (0 missing)
of < 0.7305 to the left, improve=3.951923, (0 missing)
the < 1.019 to the left, improve=3.951923, (0 missing)
Surrogate splits:
an < 0.046 to the left, agree=0.938, adj=0.833, (0 split)
which < 0.113 to the left, agree=0.938, adj=0.833, (0 split)
and < 0.467 to the right, agree=0.875, adj=0.667, (0 split)
been < 0.0275 to the left, agree=0.875, adj=0.667, (0 split)
every < 0.0125 to the left, agree=0.875, adj=0.667, (0 split)
Node number 6: 6 observations, complexity param=0.1333333
predicted class=Jay expected loss=0.5 P(node) =0.1176471
class counts: 0 1 2 3 0
probabilities: 0.000 0.167 0.333 0.500 0.000
left son=12 (3 obs) right son=13 (3 obs)
Primary splits:
an < 0.032 to the right, improve=2.333333, (0 missing)
and < 0.5665 to the left, improve=2.333333, (0 missing)
if. < 0.0175 to the left, improve=2.333333, (0 missing)
it < 0.17 to the left, improve=2.333333, (0 missing)
not < 0.0695 to the left, improve=2.333333, (0 missing)
Surrogate splits:
and < 0.5665 to the left, agree=1, adj=1, (0 split)
if. < 0.0175 to the left, agree=1, adj=1, (0 split)
it < 0.17 to the left, agree=1, adj=1, (0 split)
not < 0.0695 to the left, agree=1, adj=1, (0 split)
of < 0.7305 to the right, agree=1, adj=1, (0 split)
Node number 7: 10 observations
predicted class=Madison expected loss=0 P(node) =0.1960784
class counts: 0 0 0 0 10
probabilities: 0.000 0.000 0.000 0.000 1.000
Node number 12: 3 observations, complexity param=0.06666667
predicted class=HM expected loss=0.3333333 P(node) =0.05882353
class counts: 0 1 2 0 0
probabilities: 0.000 0.333 0.667 0.000 0.000
left son=24 (1 obs) right son=25 (2 obs)
Primary splits:
a < 0.279 to the right, improve=1.333333, (0 missing)
all < 0.0385 to the left, improve=1.333333, (0 missing)
an < 0.0485 to the right, improve=1.333333, (0 missing)
and < 0.4115 to the left, improve=1.333333, (0 missing)
any < 0.0215 to the right, improve=1.333333, (0 missing)
Node number 13: 3 observations
predicted class=Jay expected loss=0 P(node) =0.05882353
class counts: 0 0 0 3 0
probabilities: 0.000 0.000 0.000 1.000 0.000
Node number 24: 1 observations
predicted class=Hamilton expected loss=0 P(node) =0.01960784
class counts: 0 1 0 0 0
probabilities: 0.000 1.000 0.000 0.000 0.000
Node number 25: 2 observations
predicted class=HM expected loss=0 P(node) =0.03921569
class counts: 0 0 2 0 0
probabilities: 0.000 0.000 1.000 0.000 0.000
plotcp(rtree_5) # plot cross-validation results
printcp(rtree_5) # plot cross-validation results
Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 3, maxdepth = 5, minbucket = 1)
Variables actually used in tree construction:
[1] a an no upon
Root node error: 15/51 = 0.29412
n= 51
CP nsplit rel error xerror xstd
1 0.600000 0 1.000000 1.00000 0.21693
2 0.200000 1 0.400000 0.40000 0.15339
3 0.133333 2 0.200000 0.53333 0.17314
4 0.066667 3 0.066667 0.60000 0.18150
5 0.000000 4 0.000000 0.53333 0.17314
# Plot tree | lets Plot decision trees
rpart.plot(rtree_5,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree
rsq.rpart(rtree_5) # plot approximate R-squared and relative error for different splits (2 plots)
Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 3, maxdepth = 5, minbucket = 1)
Variables actually used in tree construction:
[1] a an no upon
Root node error: 15/51 = 0.29412
n= 51
CP nsplit rel error xerror xstd
1 0.600000 0 1.000000 1.00000 0.21693
2 0.200000 1 0.400000 0.40000 0.15339
3 0.133333 2 0.200000 0.53333 0.17314
4 0.066667 3 0.066667 0.60000 0.18150
5 0.000000 4 0.000000 0.53333 0.17314
may not be applicable for this method
# grow tree with cp=0 , minsplit = 10, maxdepth = 5
rtree_10 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 4, maxdepth = 3,minbucket = round(5/3))
#summarize rtree values
summary(rtree_10)
Call:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 4, maxdepth = 3, minbucket = round(5/3))
n= 51
CP nsplit rel error xerror xstd
1 0.6000000 0 1.00000000 1.0 0.2169305
2 0.2000000 1 0.40000000 0.4 0.1533930
3 0.1333333 2 0.20000000 0.4 0.1533930
4 0.0000000 3 0.06666667 0.4 0.1533930
Variable importance
upon an there and on to no which been every if. it not of
15 13 13 12 10 10 6 5 4 4 2 2 2 2
Node number 1: 51 observations, complexity param=0.6
predicted class=Hamilton expected loss=0.2941176 P(node) =1
class counts: 0 36 2 3 10
probabilities: 0.000 0.706 0.039 0.059 0.196
left son=2 (35 obs) right son=3 (16 obs)
Primary splits:
upon < 0.0145 to the right, improve=14.497550, (0 missing)
on < 0.0915 to the left, improve=12.450600, (0 missing)
there < 0.0145 to the right, improve=11.162020, (0 missing)
to < 0.499 to the right, improve= 9.435049, (0 missing)
of < 0.8705 to the right, improve= 4.936256, (0 missing)
Surrogate splits:
there < 0.0145 to the right, agree=0.941, adj=0.812, (0 split)
on < 0.0915 to the left, agree=0.882, adj=0.625, (0 split)
to < 0.474 to the right, agree=0.882, adj=0.625, (0 split)
an < 0.064 to the right, agree=0.804, adj=0.375, (0 split)
and < 0.421 to the left, agree=0.804, adj=0.375, (0 split)
Node number 2: 35 observations
predicted class=Hamilton expected loss=0 P(node) =0.6862745
class counts: 0 35 0 0 0
probabilities: 0.000 1.000 0.000 0.000 0.000
Node number 3: 16 observations, complexity param=0.2
predicted class=Madison expected loss=0.375 P(node) =0.3137255
class counts: 0 1 2 3 10
probabilities: 0.000 0.062 0.125 0.188 0.625
left son=6 (6 obs) right son=7 (10 obs)
Primary splits:
no < 0.021 to the left, improve=5.208333, (0 missing)
an < 0.046 to the left, improve=4.656818, (0 missing)
which < 0.113 to the left, improve=4.256818, (0 missing)
of < 0.7305 to the left, improve=3.951923, (0 missing)
the < 1.019 to the left, improve=3.951923, (0 missing)
Surrogate splits:
an < 0.046 to the left, agree=0.938, adj=0.833, (0 split)
which < 0.113 to the left, agree=0.938, adj=0.833, (0 split)
and < 0.467 to the right, agree=0.875, adj=0.667, (0 split)
been < 0.0275 to the left, agree=0.875, adj=0.667, (0 split)
every < 0.0125 to the left, agree=0.875, adj=0.667, (0 split)
Node number 6: 6 observations, complexity param=0.1333333
predicted class=Jay expected loss=0.5 P(node) =0.1176471
class counts: 0 1 2 3 0
probabilities: 0.000 0.167 0.333 0.500 0.000
left son=12 (3 obs) right son=13 (3 obs)
Primary splits:
an < 0.032 to the right, improve=2.333333, (0 missing)
and < 0.5665 to the left, improve=2.333333, (0 missing)
if. < 0.0175 to the left, improve=2.333333, (0 missing)
it < 0.17 to the left, improve=2.333333, (0 missing)
not < 0.0695 to the left, improve=2.333333, (0 missing)
Surrogate splits:
and < 0.5665 to the left, agree=1, adj=1, (0 split)
if. < 0.0175 to the left, agree=1, adj=1, (0 split)
it < 0.17 to the left, agree=1, adj=1, (0 split)
not < 0.0695 to the left, agree=1, adj=1, (0 split)
of < 0.7305 to the right, agree=1, adj=1, (0 split)
Node number 7: 10 observations
predicted class=Madison expected loss=0 P(node) =0.1960784
class counts: 0 0 0 0 10
probabilities: 0.000 0.000 0.000 0.000 1.000
Node number 12: 3 observations
predicted class=HM expected loss=0.3333333 P(node) =0.05882353
class counts: 0 1 2 0 0
probabilities: 0.000 0.333 0.667 0.000 0.000
Node number 13: 3 observations
predicted class=Jay expected loss=0 P(node) =0.05882353
class counts: 0 0 0 3 0
probabilities: 0.000 0.000 0.000 1.000 0.000
plotcp(rtree_10) # plot cross-validation results
printcp(rtree_10) # plot cross-validation results
Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 4, maxdepth = 3, minbucket = round(5/3))
Variables actually used in tree construction:
[1] an no upon
Root node error: 15/51 = 0.29412
n= 51
CP nsplit rel error xerror xstd
1 0.60000 0 1.000000 1.0 0.21693
2 0.20000 1 0.400000 0.4 0.15339
3 0.13333 2 0.200000 0.4 0.15339
4 0.00000 3 0.066667 0.4 0.15339
# Plot tree | lets Plot decision trees
rpart.plot(rtree_10,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree
rsq.rpart(rtree_10) # plot approximate R-squared and relative error for different splits (2 plots)
Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class",
cp = 0, minsplit = 4, maxdepth = 3, minbucket = round(5/3))
Variables actually used in tree construction:
[1] an no upon
Root node error: 15/51 = 0.29412
n= 51
CP nsplit rel error xerror xstd
1 0.60000 0 1.000000 1.0 0.21693
2 0.20000 1 0.400000 0.4 0.15339
3 0.13333 2 0.200000 0.4 0.15339
4 0.00000 3 0.066667 0.4 0.15339
may not be applicable for this method
cat("\nArticles by Author:")
Articles by Author:
table(fedPapersDF_authors$author)
dispt Hamilton HM Jay Madison
0 51 3 5 15
cat("\nTrain_data - Articles by Author:")
Train_data - Articles by Author:
table(train_data$author)
dispt Hamilton HM Jay Madison
0 36 2 3 10
cat("\nTest_data - Articles by Author:")
Test_data - Articles by Author:
table(test_data$author)
dispt Hamilton HM Jay Madison
0 15 1 2 5
predict_unseen <- predict(rtree_10, test_data, type = 'class')
# predict_unseen
table_mat <- table(test_data$author, predict_unseen)
cat("\n\nPrediction results : Confusion Matrix \n\n")
Prediction results : Confusion Matrix
# table_mat
confusionMatrix(table_mat)
Confusion Matrix and Statistics
predict_unseen
dispt Hamilton HM Jay Madison
dispt 0 0 0 0 0
Hamilton 0 15 0 0 0
HM 0 0 0 0 1
Jay 0 0 0 1 1
Madison 0 1 1 0 3
Overall Statistics
Accuracy : 0.8261
95% CI : (0.6122, 0.9505)
No Information Rate : 0.6957
P-Value [Acc > NIR] : 0.1262
Kappa : 0.6475
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: dispt Class: Hamilton Class: HM Class: Jay Class: Madison
Sensitivity NA 0.9375 0.00000 1.00000 0.6000
Specificity 1 1.0000 0.95455 0.95455 0.8889
Pos Pred Value NA 1.0000 0.00000 0.50000 0.6000
Neg Pred Value NA 0.8750 0.95455 1.00000 0.8889
Prevalence 0 0.6957 0.04348 0.04348 0.2174
Detection Rate 0 0.6522 0.00000 0.04348 0.1304
Detection Prevalence 0 0.6522 0.04348 0.08696 0.2174
Balanced Accuracy NA 0.9688 0.47727 0.97727 0.7444
# Section 3: Prediction | train data
cat("\nArticles by Author:")
Articles by Author:
table(fedPapersDF_authors$author)
dispt Hamilton HM Jay Madison
0 51 3 5 15
cat("\nTrain_data - Articles by Author:")
Train_data - Articles by Author:
table(train_data$author)
dispt Hamilton HM Jay Madison
0 36 2 3 10
cat("\nTest_data - Articles by Author:")
Test_data - Articles by Author:
table(test_data$author)
dispt Hamilton HM Jay Madison
0 15 1 2 5
predict_unseen <- predict(rtree_0, train_data, type = 'class')
# predict_unseen
table_mat <- table(train_data$author, predict_unseen)
cat("\n\nPrediction results : Confusion Matrix \n\n")
Prediction results : Confusion Matrix
# table_mat
confusionMatrix(table_mat)
Confusion Matrix and Statistics
predict_unseen
dispt Hamilton HM Jay Madison
dispt 0 0 0 0 0
Hamilton 0 36 0 0 0
HM 0 0 2 0 0
Jay 0 0 0 3 0
Madison 0 0 0 0 10
Overall Statistics
Accuracy : 1
95% CI : (0.9302, 1)
No Information Rate : 0.7059
P-Value [Acc > NIR] : 1.929e-08
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: dispt Class: Hamilton Class: HM Class: Jay Class: Madison
Sensitivity NA 1.0000 1.00000 1.00000 1.0000
Specificity 1 1.0000 1.00000 1.00000 1.0000
Pos Pred Value NA 1.0000 1.00000 1.00000 1.0000
Neg Pred Value NA 1.0000 1.00000 1.00000 1.0000
Prevalence 0 0.7059 0.03922 0.05882 0.1961
Detection Rate 0 0.7059 0.03922 0.05882 0.1961
Detection Prevalence 0 0.7059 0.03922 0.05882 0.1961
Balanced Accuracy NA 1.0000 1.00000 1.00000 1.0000
# Section 3: Prediction | Test Data
cat("\nArticles by Author:")
Articles by Author:
table(fedPapersDF_authors$author)
dispt Hamilton HM Jay Madison
0 51 3 5 15
cat("\nTrain_data - Articles by Author:")
Train_data - Articles by Author:
table(train_data$author)
dispt Hamilton HM Jay Madison
0 36 2 3 10
cat("\nTest_data - Articles by Author:")
Test_data - Articles by Author:
table(test_data$author)
dispt Hamilton HM Jay Madison
0 15 1 2 5
predict_DT0 <- predict(rtree_0, test_data, type = 'class')
# predict_unseen
table_DT0 <- table(test_data$author, predict_DT0)
cat("\n\nPrediction results : Confusion Matrix \n\n")
Prediction results : Confusion Matrix
# table_mat
confusionMatrix(table_DT0)
Confusion Matrix and Statistics
predict_DT0
dispt Hamilton HM Jay Madison
dispt 0 0 0 0 0
Hamilton 0 15 0 0 0
HM 0 0 0 0 1
Jay 0 0 0 1 1
Madison 0 2 0 0 3
Overall Statistics
Accuracy : 0.8261
95% CI : (0.6122, 0.9505)
No Information Rate : 0.7391
P-Value [Acc > NIR] : 0.2447
Kappa : 0.6275
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: dispt Class: Hamilton Class: HM Class: Jay Class: Madison
Sensitivity NA 0.8824 NA 1.00000 0.6000
Specificity 1 1.0000 0.95652 0.95455 0.8889
Pos Pred Value NA 1.0000 NA 0.50000 0.6000
Neg Pred Value NA 0.7500 NA 1.00000 0.8889
Prevalence 0 0.7391 0.00000 0.04348 0.2174
Detection Rate 0 0.6522 0.00000 0.04348 0.1304
Detection Prevalence 0 0.6522 0.04348 0.08696 0.2174
Balanced Accuracy NA 0.9412 NA 0.97727 0.7444
# Section 3: Prediction | Test Data
cat("\nArticles by Author:")
Articles by Author:
table(fedPapersDF_authors$author)
dispt Hamilton HM Jay Madison
0 51 3 5 15
cat("\nTrain_data - Articles by Author:")
Train_data - Articles by Author:
table(train_data$author)
dispt Hamilton HM Jay Madison
0 36 2 3 10
cat("\nTest_data - Articles by Author:")
Test_data - Articles by Author:
table(test_data$author)
dispt Hamilton HM Jay Madison
0 15 1 2 5
predict_DT1 <- predict(rtree_1, test_data, type = 'class')
# predict_unseen
table_DT1 <- table(test_data$author, predict_DT1)
cat("\n\nPrediction results : Confusion Matrix \n\n")
Prediction results : Confusion Matrix
# table_mat
confusionMatrix(table_DT1)
Confusion Matrix and Statistics
predict_DT1
dispt Hamilton HM Jay Madison
dispt 0 0 0 0 0
Hamilton 0 15 0 0 0
HM 0 0 0 0 1
Jay 0 0 0 1 1
Madison 0 2 0 0 3
Overall Statistics
Accuracy : 0.8261
95% CI : (0.6122, 0.9505)
No Information Rate : 0.7391
P-Value [Acc > NIR] : 0.2447
Kappa : 0.6275
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: dispt Class: Hamilton Class: HM Class: Jay Class: Madison
Sensitivity NA 0.8824 NA 1.00000 0.6000
Specificity 1 1.0000 0.95652 0.95455 0.8889
Pos Pred Value NA 1.0000 NA 0.50000 0.6000
Neg Pred Value NA 0.7500 NA 1.00000 0.8889
Prevalence 0 0.7391 0.00000 0.04348 0.2174
Detection Rate 0 0.6522 0.00000 0.04348 0.1304
Detection Prevalence 0 0.6522 0.04348 0.08696 0.2174
Balanced Accuracy NA 0.9412 NA 0.97727 0.7444
# Section 3: Prediction | Test Data
cat("\nArticles by Author:")
Articles by Author:
table(fedPapersDF_authors$author)
dispt Hamilton HM Jay Madison
0 51 3 5 15
cat("\nTrain_data - Articles by Author:")
Train_data - Articles by Author:
table(train_data$author)
dispt Hamilton HM Jay Madison
0 36 2 3 10
cat("\nTest_data - Articles by Author:")
Test_data - Articles by Author:
table(test_data$author)
dispt Hamilton HM Jay Madison
0 15 1 2 5
predict_DT2 <- predict(rtree_2, test_data, type = 'class')
# predict_unseen
table_DT2 <- table(test_data$author, predict_DT2)
cat("\n\nPrediction results : Confusion Matrix \n\n")
Prediction results : Confusion Matrix
# table_mat
confusionMatrix(table_DT2)
Confusion Matrix and Statistics
predict_DT2
dispt Hamilton HM Jay Madison
dispt 0 0 0 0 0
Hamilton 0 15 0 0 0
HM 0 0 0 0 1
Jay 0 0 0 1 1
Madison 0 2 0 0 3
Overall Statistics
Accuracy : 0.8261
95% CI : (0.6122, 0.9505)
No Information Rate : 0.7391
P-Value [Acc > NIR] : 0.2447
Kappa : 0.6275
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: dispt Class: Hamilton Class: HM Class: Jay Class: Madison
Sensitivity NA 0.8824 NA 1.00000 0.6000
Specificity 1 1.0000 0.95652 0.95455 0.8889
Pos Pred Value NA 1.0000 NA 0.50000 0.6000
Neg Pred Value NA 0.7500 NA 1.00000 0.8889
Prevalence 0 0.7391 0.00000 0.04348 0.2174
Detection Rate 0 0.6522 0.00000 0.04348 0.1304
Detection Prevalence 0 0.6522 0.04348 0.08696 0.2174
Balanced Accuracy NA 0.9412 NA 0.97727 0.7444
# Section 3: Prediction | Test Data
cat("\nArticles by Author:")
Articles by Author:
table(fedPapersDF_authors$author)
dispt Hamilton HM Jay Madison
0 51 3 5 15
cat("\nTrain_data - Articles by Author:")
Train_data - Articles by Author:
table(train_data$author)
dispt Hamilton HM Jay Madison
0 36 2 3 10
cat("\nTest_data - Articles by Author:")
Test_data - Articles by Author:
table(test_data$author)
dispt Hamilton HM Jay Madison
0 15 1 2 5
predict_DT5 <- predict(rtree_5, test_data, type = 'class')
# predict_unseen
table_DT5 <- table(test_data$author, predict_DT5)
cat("\n\nPrediction results : Confusion Matrix \n\n")
Prediction results : Confusion Matrix
# table_mat
confusionMatrix(table_DT5)
Confusion Matrix and Statistics
predict_DT5
dispt Hamilton HM Jay Madison
dispt 0 0 0 0 0
Hamilton 0 15 0 0 0
HM 0 0 0 0 1
Jay 0 0 0 1 1
Madison 0 2 0 0 3
Overall Statistics
Accuracy : 0.8261
95% CI : (0.6122, 0.9505)
No Information Rate : 0.7391
P-Value [Acc > NIR] : 0.2447
Kappa : 0.6275
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: dispt Class: Hamilton Class: HM Class: Jay Class: Madison
Sensitivity NA 0.8824 NA 1.00000 0.6000
Specificity 1 1.0000 0.95652 0.95455 0.8889
Pos Pred Value NA 1.0000 NA 0.50000 0.6000
Neg Pred Value NA 0.7500 NA 1.00000 0.8889
Prevalence 0 0.7391 0.00000 0.04348 0.2174
Detection Rate 0 0.6522 0.00000 0.04348 0.1304
Detection Prevalence 0 0.6522 0.04348 0.08696 0.2174
Balanced Accuracy NA 0.9412 NA 0.97727 0.7444
# Section 3: Prediction | Test Data
cat("\nArticles by Author:")
Articles by Author:
table(fedPapersDF_authors$author)
dispt Hamilton HM Jay Madison
0 51 3 5 15
cat("\nTrain_data - Articles by Author:")
Train_data - Articles by Author:
table(train_data$author)
dispt Hamilton HM Jay Madison
0 36 2 3 10
cat("\nTest_data - Articles by Author:")
Test_data - Articles by Author:
table(test_data$author)
dispt Hamilton HM Jay Madison
0 15 1 2 5
predict_DT10 <- predict(rtree_10, test_data, type = 'class')
# predict_unseen
table_DT10 <- table(test_data$author, predict_DT10)
cat("\n\nPrediction results : Confusion Matrix \n\n")
Prediction results : Confusion Matrix
# table_mat
confusionMatrix(table_DT10)
Confusion Matrix and Statistics
predict_DT10
dispt Hamilton HM Jay Madison
dispt 0 0 0 0 0
Hamilton 0 15 0 0 0
HM 0 0 0 0 1
Jay 0 0 0 1 1
Madison 0 1 1 0 3
Overall Statistics
Accuracy : 0.8261
95% CI : (0.6122, 0.9505)
No Information Rate : 0.6957
P-Value [Acc > NIR] : 0.1262
Kappa : 0.6475
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: dispt Class: Hamilton Class: HM Class: Jay Class: Madison
Sensitivity NA 0.9375 0.00000 1.00000 0.6000
Specificity 1 1.0000 0.95455 0.95455 0.8889
Pos Pred Value NA 1.0000 0.00000 0.50000 0.6000
Neg Pred Value NA 0.8750 0.95455 1.00000 0.8889
Prevalence 0 0.6957 0.04348 0.04348 0.2174
Detection Rate 0 0.6522 0.00000 0.04348 0.1304
Detection Prevalence 0 0.6522 0.04348 0.08696 0.2174
Balanced Accuracy NA 0.9688 0.47727 0.97727 0.7444
# Section 3: Prediction | Disputed Data
cat("\nDisputed Articles by Author:")
Disputed Articles by Author:
table(fedPapersDF_Dispt$author)
dispt Hamilton HM Jay Madison
11 0 0 0 0
cat("\nArticles by Author:")
Articles by Author:
table(fedPapersDF_authors$author)
dispt Hamilton HM Jay Madison
0 51 3 5 15
cat("\nTrain_data - Articles by Author:")
Train_data - Articles by Author:
table(train_data$author)
dispt Hamilton HM Jay Madison
0 36 2 3 10
cat("\nTest_data - Articles by Author:")
Test_data - Articles by Author:
table(test_data$author)
dispt Hamilton HM Jay Madison
0 15 1 2 5
predict_final <- predict(rtree_5, fedPapersDF_Dispt, type = 'class')
table_final <- table(fedPapersDF_Dispt$author, predict_final)
cat("\n\nPrediction results : \n\n")
Prediction results :
table_final
predict_final
dispt Hamilton HM Jay Madison
dispt 0 0 2 1 8
Hamilton 0 0 0 0 0
HM 0 0 0 0 0
Jay 0 0 0 0 0
Madison 0 0 0 0 0
predict_finaldf <- data.frame(predict_final)
cat("\n\nPrediction results by article : \n\n")
Prediction results by article :
View(predict_finaldf)
# Random Forest prediction of fedPapersDF1 data
EnsurePackage("randomForest")
# View(fedPapersDF1)
cat("\n All Articles by Author:")
All Articles by Author:
table(fedPapersDF$author)
dispt Hamilton HM Jay Madison
11 51 3 5 15
fit <- randomForest(y=fedPapersDF1$author, x=fedPapersDF1[2:ncol(fedPapersDF1)], data=fedPapersDF1, ntree=100
, keep.forest=FALSE, importance=TRUE)
print(fit) # view results
Call:
randomForest(x = fedPapersDF1[2:ncol(fedPapersDF1)], y = fedPapersDF1$author, ntree = 100, importance = TRUE, keep.forest = FALSE, data = fedPapersDF1)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 8
OOB estimate of error rate: 20%
Confusion matrix:
dispt Hamilton HM Jay Madison class.error
dispt 3 3 0 0 5 0.7272727
Hamilton 0 51 0 0 0 0.0000000
HM 1 0 0 0 2 1.0000000
Jay 2 1 0 1 1 0.8000000
Madison 1 1 0 0 13 0.1333333
importance(fit) # importance of each predictor
dispt Hamilton HM Jay Madison MeanDecreaseAccuracy MeanDecreaseGini
a 7.295406e-01 1.31389708 1.4285714 0.0000000 0.4968289 1.85552028 0.65119316
all 0.000000e+00 0.44985090 0.0000000 1.0050378 1.0050378 1.04802422 0.32462481
also 8.047741e-01 -0.28408957 0.0000000 1.3538810 1.4310476 1.16121998 1.08802976
an 1.823067e+00 3.10344808 0.0000000 1.9483241 1.9834264 3.92437078 1.53432189
and 1.942938e-01 2.75525053 0.0000000 3.0151134 1.5900315 3.52183438 2.30345199
any -1.690309e+00 1.00618490 0.0000000 1.0050378 1.0665287 0.73366470 0.93553176
are 1.158490e+00 -1.00503782 0.0000000 -1.0050378 0.0000000 -0.01679945 0.26907343
as 1.688027e+00 -0.10606781 1.3538810 0.0000000 -0.5598318 1.50888983 0.65918280
at 1.005038e+00 1.18390806 0.0000000 0.0000000 -1.1280610 1.11018891 0.44831523
be 2.204007e+00 -0.05825098 1.0050378 0.0000000 1.0043544 2.15640466 0.76631605
been 1.060057e-01 -0.77793147 1.7586311 0.0000000 -0.6137462 0.02940495 0.90546440
but -1.005038e+00 -0.86225174 0.0000000 0.0000000 -0.6033810 -1.25447538 0.24152540
by 2.982350e+00 3.00931669 0.0000000 -1.1915865 1.4249075 3.32753972 2.35737101
can -5.579525e-01 1.42104221 0.0000000 -1.0050378 0.3630434 0.88289566 0.53384633
do -1.005038e+00 -1.00503782 0.0000000 0.0000000 1.4242781 -0.02244506 0.26216153
down 0.000000e+00 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000
even -5.783149e-01 0.00000000 0.0000000 0.0000000 -1.7541160 -1.58671486 0.28950924
every 1.493480e+00 1.43055953 1.4285714 0.0000000 1.2761821 2.60576627 1.14529592
for. 0.000000e+00 1.00503782 0.0000000 0.0000000 -1.0050378 -0.02318078 0.20101329
from -3.431991e-01 -0.31074635 0.0000000 -1.4002801 -0.5509390 -1.75441798 0.35806619
had 2.000400e-01 1.00503782 0.0000000 0.0000000 -1.0050378 0.30188737 0.22509336
has 2.327930e+00 0.03553427 0.0000000 -1.5249857 0.4816182 1.38576798 0.98691512
have 1.366722e+00 -0.42620706 0.0000000 1.0050378 0.3888079 0.88268208 0.40141251
her -1.005038e+00 1.00503782 0.0000000 0.0000000 -1.0050378 -1.00503782 0.15280952
his -1.280474e-01 1.00503782 0.0000000 0.0000000 0.4690364 1.05980251 0.23800000
if. -8.409316e-01 0.06645958 0.0000000 0.0000000 1.0050378 -0.26595316 0.31971927
in. 3.345124e-01 1.01717357 1.0050378 -1.4002801 0.3872553 0.69230284 1.34079432
into 1.005038e+00 0.88379737 0.0000000 1.0050378 -1.0050378 0.96069733 0.30674774
is 2.627035e-01 1.00503782 1.0050378 1.0050378 -0.8322688 0.49841441 0.56641393
it -7.942911e-01 -1.00503782 0.0000000 0.0000000 -1.2265140 -1.40474512 0.26183009
its 1.005038e+00 0.08304834 -1.0050378 -1.0050378 1.9137626 1.78220139 0.54259654
may 0.000000e+00 1.00503782 0.0000000 -1.0050378 -0.1561928 0.09944180 0.27158956
more -2.325581e-01 0.24525400 0.0000000 0.0000000 -0.1084716 0.25574559 0.31549911
must 0.000000e+00 1.00503782 1.0050378 0.0000000 1.4242781 1.65078685 0.42001885
my 0.000000e+00 1.00503782 0.0000000 0.0000000 -1.7216897 -0.44743039 0.09682012
no 9.615735e-03 0.07369462 -0.4476615 -1.0050378 0.9879329 0.38198970 0.84016672
not 1.151825e+00 -1.03196684 1.3538810 -0.2774568 -1.3244845 -0.07093695 0.60606044
now 0.000000e+00 -1.00503782 0.0000000 0.0000000 0.0000000 -1.00503782 0.13929625
of -9.921184e-01 0.68021873 0.0000000 2.4722853 0.8410479 1.53280125 1.28579447
on 1.451258e+00 4.36342544 -1.0050378 -1.5911978 2.1188560 3.81586034 2.79284944
one 4.267896e-01 1.42427806 -1.0050378 1.0050378 1.0050378 1.09430599 0.52074431
only -1.005038e+00 -0.12775114 0.0000000 0.4476615 1.0050378 -0.07056397 0.52523582
or 0.000000e+00 1.00503782 0.0000000 0.0000000 -0.7558156 0.11797550 0.28375323
our 0.000000e+00 0.06501857 0.0000000 0.0000000 -0.1061236 -0.36920322 0.35557975
shall 1.749453e+00 0.65901824 0.0000000 -1.0050378 0.8675606 1.60240930 0.57966806
should -1.413925e+00 0.66654754 1.0050378 0.0000000 0.1289908 0.20633552 0.50907639
so 0.000000e+00 0.64457274 0.0000000 0.0000000 0.0000000 0.74421972 0.33291210
some 1.555910e+00 -0.05058133 0.0000000 0.0000000 1.2803688 1.71918265 0.44643757
such -1.005038e+00 1.41785354 0.0000000 0.0000000 0.0000000 0.63279500 0.26098388
than 0.000000e+00 -1.42824159 1.0050378 0.0000000 -1.4242781 -1.57908883 0.38626672
that 4.476615e-01 -1.00503782 1.0050378 0.0000000 1.7227589 1.01225856 0.23127042
the 0.000000e+00 -0.09407625 0.0000000 2.1568925 0.8928348 1.83191730 1.20504815
their 1.513820e+00 0.48468813 -1.0050378 1.3538810 -0.2935817 2.12720053 0.64341943
then 0.000000e+00 -1.35388105 -1.0050378 0.0000000 0.0000000 -1.64240316 0.16344570
there 3.550505e-17 3.84019564 1.3538810 -0.2294761 3.7006005 4.05865516 2.86178759
things 0.000000e+00 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.01000000
this 0.000000e+00 -1.00503782 0.0000000 1.0050378 1.0050378 0.02175461 0.45741485
to -8.608709e-01 3.35314254 1.0050378 0.3335187 1.4082016 3.98456227 2.47379188
up -1.005038e+00 1.00503782 0.0000000 0.0000000 0.2722664 0.57900475 0.18023179
upon 4.665254e+00 7.25278888 1.0050378 2.7522860 4.3080837 7.62654635 5.66827478
was -8.491993e-01 0.43238897 1.0050378 -1.4139250 2.4075514 1.13102028 0.70107848
were -1.280474e-01 -1.00503782 0.0000000 0.0000000 0.5490802 -0.61172508 0.34497031
what 1.005038e+00 -1.00503782 0.0000000 0.0000000 0.2461646 -0.01845473 0.21280474
when 1.740777e+00 1.42857143 -1.0050378 -1.0050378 0.0000000 1.25400593 0.34647203
which 0.000000e+00 1.75809244 0.0000000 1.4002801 1.7309693 2.21774005 0.49400968
who -1.005038e+00 1.00503782 0.0000000 0.0000000 0.0000000 0.48647227 0.24923787
will -7.464487e-01 -0.02640373 0.0000000 1.0050378 1.1584896 0.02913939 0.65786191
with -1.005038e+00 -1.74729192 0.0000000 1.0050378 -0.4726493 -1.58066626 0.28930846
would 7.101010e-17 2.05129105 0.0000000 -0.2774568 1.4242781 1.51921803 0.87874498
your 0.000000e+00 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.08685936
rf_importance <- data.frame(importance(fit)) # importance of each predictor
rf_importance
# Random Forest prediction of fedPapersDF1 data
EnsurePackage("randomForest")
# View(fedPapersDF1)
cat("\n All Articles by Author:")
All Articles by Author:
table(fedPapersDF$author)
dispt Hamilton HM Jay Madison
11 51 3 5 15
fit <- randomForest(y=fedPapersDF1$author, x=fedPapersDF1[2:ncol(fedPapersDF1)], data=fedPapersDF1, ntree=100
, keep.forest=FALSE, importance=TRUE)
print(fit) # view results
Call:
randomForest(x = fedPapersDF1[2:ncol(fedPapersDF1)], y = fedPapersDF1$author, ntree = 100, importance = TRUE, keep.forest = FALSE, data = fedPapersDF1)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 8
OOB estimate of error rate: 22.35%
Confusion matrix:
dispt Hamilton HM Jay Madison class.error
dispt 2 2 0 0 7 0.8181818
Hamilton 0 51 0 0 0 0.0000000
HM 0 0 0 1 2 1.0000000
Jay 1 0 0 3 1 0.4000000
Madison 1 4 0 0 10 0.3333333
importance(fit) # importance of each predictor
dispt Hamilton HM Jay Madison MeanDecreaseAccuracy MeanDecreaseGini
a 0.8141880 2.40158236 1.6903085 1.7586311 0.52979064 2.723332456 1.074961499
all -1.4285714 1.40028008 0.0000000 0.0000000 1.00503782 0.640097047 0.128500000
also -0.6637233 -0.27553536 0.0000000 1.6903085 -1.26690431 -0.326233046 0.936142162
an 2.0108422 2.01342693 1.0050378 3.1094985 1.05814822 3.120989957 2.021625066
and -1.3109902 3.01901653 1.3538810 3.0454787 -0.83891019 3.611830368 1.939410126
any -0.2000400 1.55060537 -1.0050378 -1.0050378 -0.95836918 0.931891409 0.558205327
are 0.0000000 -1.03381961 0.0000000 -1.0050378 0.13442849 -0.504247075 0.382612923
as 0.2774568 0.35794034 1.4285714 0.0000000 0.57831493 0.864461770 0.609635689
at -0.1533750 1.73863481 0.0000000 0.0000000 1.49036823 1.756078942 0.629939790
be 1.0050378 1.42380375 2.6926023 -1.0050378 0.86929312 2.415476889 0.831813979
been -0.3360069 0.33261246 1.0050378 0.9082573 0.99647476 1.114090450 1.090647995
but -0.2000400 1.50066897 0.0000000 1.0050378 0.08671426 1.422964691 0.440945151
by 0.0407573 2.04283057 0.0000000 0.6415003 0.13117478 1.891153699 1.652448186
can 0.0000000 -1.42857143 0.0000000 0.0000000 1.00503782 -0.295172222 0.399286617
do 0.0000000 -1.00503782 -1.0050378 0.0000000 0.00000000 -1.428362091 0.136002540
down -1.0050378 -1.00503782 0.0000000 0.0000000 -1.00503782 -1.381317263 0.091571429
even 0.0000000 -1.42324644 0.0000000 0.0000000 -1.35388105 -1.258235639 0.159750000
every 0.0000000 1.54320180 1.0050378 1.0050378 -0.02794308 1.706150428 0.760287372
for. 0.0000000 -1.10944767 1.0050378 0.0000000 -0.79255609 -1.426556688 0.336744298
from 1.7015962 -0.05137261 0.0000000 0.0000000 1.42857143 1.202124003 0.655468678
had 0.0000000 0.00000000 0.0000000 1.0050378 -1.00503782 0.000000000 0.165669448
has 2.9499150 1.67777148 -1.7586311 -0.3335187 2.46510026 2.959424880 1.044080558
have 0.9581896 -0.12396938 0.0000000 0.0000000 0.89842912 0.778576015 0.727342007
her -1.0050378 -0.14806287 0.0000000 0.0000000 -1.00503782 -1.428312025 0.122704798
his 0.0000000 -0.68728294 0.0000000 0.0000000 1.24580153 0.369637842 0.217884712
if. -1.3538810 0.00000000 0.0000000 1.3538810 -1.42857143 -0.670831227 0.437049394
in. 1.3538810 0.85262077 1.4285714 1.0050378 0.21989985 1.219069134 0.775348077
into -0.4476615 -1.12960533 0.0000000 -0.6876142 -1.00503782 -1.051314013 0.298076565
is 0.0000000 0.00000000 -1.4285714 1.5456644 0.30164849 0.933479188 0.315722222
it -1.0050378 0.00000000 -1.0050378 -1.0050378 -1.00503782 -1.655142963 0.181166763
its -0.5862104 1.30354589 0.0000000 0.0000000 0.02796735 0.668128833 0.498337173
may -1.5862879 0.00000000 1.0050378 -1.0050378 1.34096493 -0.026517067 0.484785856
more 1.0050378 0.00000000 0.0000000 1.4285714 -1.00503782 1.073176089 0.276178880
must 1.0050378 1.00503782 0.0000000 0.0000000 -1.00503782 0.566037526 0.160461538
my 0.0000000 0.00000000 0.0000000 0.0000000 1.00503782 1.005037815 0.013333333
no 0.3181603 -2.01353792 -0.3335187 0.0000000 2.42139391 0.469212830 0.790700250
not -0.3145027 1.42573279 1.0050378 1.4002801 0.29050934 1.106838732 0.808509656
now 1.7494534 -1.00503782 -1.0050378 0.0000000 -0.77693097 -0.409668274 0.293514508
of 0.1280474 2.73665158 0.0000000 3.1448545 1.23390468 3.354598823 1.711806594
on 3.0266908 4.36819343 0.4476615 -1.5911978 2.49389760 4.911140768 2.647717386
one 0.1561928 1.29262860 0.0000000 1.0050378 0.00000000 1.307375739 0.694572602
only 1.4285714 0.69548318 0.0000000 0.0000000 -1.32453236 0.480546683 0.467899952
or -1.4002801 -0.82119764 -1.0050378 -1.0050378 0.33989333 -0.876655669 0.669918081
our -1.4196573 1.42056323 0.0000000 1.4002801 -1.00503782 0.005730474 0.211757802
shall 1.4196573 1.00503782 0.0000000 0.0000000 -0.63765607 0.611309253 0.295124657
should 0.5783149 0.99640468 0.0000000 0.0000000 0.05414071 0.820640659 0.511198834
so 0.0000000 0.00000000 0.0000000 0.0000000 -0.20004001 -0.136235624 0.143777778
some 1.2751534 0.69109766 0.0000000 0.0000000 0.00000000 1.253351114 0.288539119
such 0.0000000 1.59992942 0.0000000 0.0000000 -1.35388105 1.047411699 0.188615348
than -0.8559210 0.00000000 0.0000000 0.0000000 -0.05494227 -0.608422595 0.249236274
that 0.2182699 0.39949082 1.0050378 0.0000000 1.65493073 1.873838035 0.607429354
the 1.3538810 0.42119132 0.0000000 1.5294382 0.87574550 1.428560612 0.952728290
their 1.4285714 0.57528417 1.0050378 0.0000000 -0.78741686 0.862054063 0.400457686
then 0.0000000 1.00503782 0.0000000 0.0000000 0.00000000 1.005037815 0.189033993
there -1.1020775 3.55783505 0.0000000 0.4476615 4.09896440 4.033428132 2.663423073
things -1.0050378 0.00000000 0.0000000 0.0000000 0.00000000 -1.005037815 0.066515152
this -1.0050378 1.33160102 0.0000000 0.0000000 1.69725026 1.642999201 0.320695975
to -0.7699206 3.25289118 1.0050378 0.0000000 2.85686860 3.155368935 3.058179318
up 0.0000000 0.00000000 1.0050378 0.0000000 0.00000000 1.005037815 0.125099778
upon 3.4574111 6.99757420 1.3538810 2.1882490 5.40596069 7.139375344 6.345506204
was -1.1144180 -1.47463897 1.0050378 0.0000000 -1.00503782 -1.519821970 0.466117037
were -0.3431991 1.09964909 0.0000000 0.0000000 1.00503782 1.436174623 0.506280580
what 0.0000000 1.00503782 0.0000000 0.0000000 0.00000000 1.005037815 0.111764706
when 1.4087457 -1.00503782 0.0000000 -1.0050378 0.42237411 0.113766554 0.248016106
which -0.1561928 1.38063475 -1.0050378 1.9660901 1.74945338 2.480267695 0.660858240
who -0.2000400 -0.87650722 0.0000000 0.0000000 0.00000000 -0.739704410 0.155808537
will -0.9642111 1.30182796 0.0000000 0.0000000 0.30164849 1.005605556 0.731115521
with -1.4285714 0.11565542 0.0000000 0.0000000 1.00503782 -0.442704303 0.323980357
would 1.3388584 1.53456170 1.0050378 -0.2182699 0.34492768 1.920871207 0.765781222
your 0.0000000 -1.00503782 0.0000000 0.0000000 0.00000000 -1.005037815 0.008768116